Linear Modelling

Author

Peter Levy

Published

June 23, 2025

The aim of this practical is to estimate the parameters of very simple linear models in the Bayesian way. We will use the ideas from the previous sessions about MCMC and prior distributions. We will show that estimating models in this way can be almost as quick and easy as the standard least-squares approach (e.g. lm). We will use the R package rstanarm, which has familiar syntax but uses an MCMC algorithm “under the hood”, and uses sensible generic prior distributions by default. The application is to a data set on dimensions of individual trees, and how different attributes scale in relation to each other.

Work through this practical at your own pace. You can ask one of the trainers for help at any time. We will reconvene every 15 minutes or so to explore your findings and address common questions and issues.

Tree allometry

Allometry is the study of how the characteristics of living things change with their size1. The use of allometry is widespread in forestry and forest ecology. Allometric relationships are used to estimate attributes of trees that are difficult to measure, such as volume or mass, from more easily measured attributes such as diameter at breast height (dbh). For example, we might want to estimate the total above-ground biomass (and thus carbon sequestration) of a forest. Measuring the actual mass of all the trees non-destructively is impossible. However, we can measure the diameter (and other properties) of the trees, and use this to predict the total biomass with an allometric relationship.

We know there will be allometric relationships from the basic geometry (tall trees are generally wider), but other factors come into play which affect tree growth form and wood density. These include species, spacing between trees, climate and site location. Here, the aim is to model these allometric relationships using linear models.

Tree allometry data

First, we read in some data for UK conifer species (Levy, Hale, and Nicoll 2004). This contains measurements of diameter at breast height (dbh in cm) and total above-ground biomass (tree_mass in kg) for 1901 trees.

Run the following code in R. You can use the clipboard symbol below to copy and paste.

df <- read.csv(url("https://raw.githubusercontent.com/NERC-CEH/beem_data/main/trees.csv"))
names(df)
 [1] "tree_number"   "age"           "species"       "cultivation"  
 [5] "soil_type"     "tree_height"   "timber_height" "dbh"          
 [9] "tree_mass"     "stem_mass"     "taper"         "crown_mass"   
[13] "root_mass"    
dim(df)
[1] 1901   13

It is common practice to transfom the diameter at breast height to a cross-sectional area (\(\pi (d/2)^2\), referred to as “basal area”), as this gives a more linear relationship with volume and mass.

df$basal_area <- pi * (df$dbh/2)^2

Plot the relationship between basal area and tree mass using your own code, or that below.

Code
library(ggplot2)
ggplot(df, aes(basal_area, tree_mass)) + geom_point()

Is there a reasonable linear relationship? If so, we can approximate the data with a linear model.

Bayesian Estimation of The Linear Model

The standard linear model uses the equation:

\[y = \alpha + \beta x + \epsilon\]

where in this case, \(y\) is the tree mass, \(x\) is the basal area, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon\) is the unexplained residual or “error” term. \(\beta\) expresses the change in tree mass (in kg) with a unit increase in basal area (in cm2), so has units of kg/cm2. \(\alpha\) is the hypothetical mass of a tree with zero basal area; you could interpret this as the mass of a tree just short of breast height (1.3 m). We assume \(\epsilon\) comes from a normal distribution with mean zero and standard deviation \(\sigma\).

We can estimate these model parameters in the Bayesian approach using the R package rstanarm2 with the code below. The function stan_glm uses syntax familiar to R users, with a formula argument (the same as in lm and many other related model-fitting functions). However, it uses a Markov chain Monte Carlo (MCMC) algorithm3 to estimate the parameters (instead of least squares or optimisation algorithms).

Load the libraries

Load the rstanarm library. ggeffects is used later for plotting output.

library(rstanarm)
library(ggeffects)
Fit the model

The syntax below specifies the model: \(\mathrm{tree \_ mass} = \alpha + \beta \times \mathrm{basal \_ area} + \epsilon\) and the data frame to use.

m <- stan_glm(tree_mass ~ basal_area, data = df, iter = 300, refresh = 50)
More iterations

Set refresh to 0 to remove progress reporting during the MCMC runs. Around 1000 iterations seems to work, but we can use several thousand because this simple model is very fast.

m <- stan_glm(tree_mass ~ basal_area, data = df, iter = 4000, refresh = 0)
Examine the output

In common with other model-fitting functions in R, the \(\beta\) regression coefficient which acts as a multiplier on basal_area is denoted simply as basal_area in the output; the \(\alpha\) coeficient is denoted as (Intercept). We can examine the trace plots from the iterations of the MCMC chains:

plot(m, "trace")
Output

and we can now get some summary output for the estimated posterior distribution:

summary(m, digits = 5)
Output

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      tree_mass ~ basal_area
 algorithm:    sampling
 sample:       8000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1901
 predictors:   2

Estimates:
              mean      sd        10%       50%       90%    
(Intercept) -59.19016   3.95276 -64.25698 -59.16876 -54.13814
basal_area    1.05402   0.00933   1.04202   1.05412   1.06600
sigma        87.96231   1.44239  86.12394  87.94919  89.84348

Fit Diagnostics:
           mean      sd        10%       50%       90%    
mean_PPD 324.73078   2.87349 321.04221 324.71036 328.39522

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse    Rhat    n_eff
(Intercept)   0.04512 1.00012 7675 
basal_area    0.00010 1.00059 8345 
sigma         0.01700 0.99964 7201 
mean_PPD      0.03318 1.00006 7501 
log-posterior 0.02204 1.00029 3159 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

and the posterior distribution of the parameters:

plot(m, "hist")
Output

We can also explore results of the MCMC via a built-in Shiny app, launched by the launch_shinystan command:

launch_shinystan(m)

This provides a huge amount of detail - much more than you need just now, but key things are to check convergence and parameter distributions.

Examine the model fit with summary(m) and launch_shinystan(m).

Questions
  1. Have the MCMC chains converged?
    Hints Visually inspect the trace plots for systematic changes, divergence. Is the Gelman-Rubin Statistic Rhat close to 1?
  2. Do the parameter distributions look “sensible”?
  3. On average, how accurate is the model?
    Hints Check the value of \(\sigma\) (sigma) in the summary output. Examine the posterior distribution of sigma.
Check predictions
Plot the predictions yourself or with the code below. The ggpredict function can be used to extract the model predictions and intervals for plotting with ggplot.
Code
df_pred <- as.data.frame(ggpredict(m, terms = "basal_area [0:2400]"))
p <- ggplot(df_pred, aes(x, predicted))
p <- p + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), fill = "pink")
p <- p + geom_line()
p <- p + geom_point(data = df, aes(basal_area, tree_mass))
p

We can create 95 % prediction intervals by getting the appropriate quantiles of the posterior samples. Taking a mid-range basal area of 1000 cm2 (dbh ~ 35 cm):

y_rep <- posterior_predict(m, newdata = data.frame(basal_area = 1000))
quantile(y_rep, c(0.025, 0.5, 0.975))
     2.5%       50%     97.5% 
 830.1931  995.7315 1167.6193 

We can also check the model predictions by plotting how well they reproduce the observed data. In these plots, \(y\) is the observations, \(y_rep\) are (replicate) samples from the posterior distribution.

pp_check(m, plotfun = "boxplot", nreps = 10, notch = FALSE)
pp_check(m, plotfun = "hist", nreps = 3)
pp_check(m)
Output
pp_check(m, plotfun = "boxplot", nreps = 10, notch = FALSE)

pp_check(m, plotfun = "hist", nreps = 3)

pp_check(m)

Questions
  1. How satisfactory is the model? How might it be improved?

Further things to explore

Variation among species

Is the model likely to be improved if we account for variation among species?
Code
p <- ggplot(df, aes(basal_area, tree_mass, colour = species))
p <- p + geom_point()
p <- p + stat_smooth(method = "lm")
p

Multivariate version

Is the model improved by including additional variables?
Code
m_multi <- stan_glm(tree_mass ~ basal_area +
  age + species + cultivation + soil_type + tree_height + timber_height,
  data = df, refresh = 0)
summary(m_multi)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      tree_mass ~ basal_area + age + species + cultivation + soil_type + 
       tree_height + timber_height
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1881
 predictors:   34

Estimates:
                mean   sd     10%    50%    90% 
(Intercept)   -203.4   13.9 -221.1 -203.3 -186.4
basal_area       0.8    0.0    0.8    0.8    0.8
age             -0.2    0.4   -0.7   -0.2    0.2
speciesDF       24.2   15.2    4.6   24.3   44.0
speciesEL        0.7   16.9  -21.2    1.1   22.3
speciesGF      -16.2   14.2  -34.3  -16.3    2.3
speciesJL      -14.2   13.1  -30.6  -14.2    2.4
speciesLC       51.5   27.8   16.0   51.3   87.2
speciesLP       19.2    9.6    7.0   19.2   31.3
speciesNF       43.4   18.6   20.0   43.5   67.2
speciesNS       29.1   10.3   15.9   29.1   42.1
speciesRC      -58.5   24.6  -90.0  -58.7  -27.1
speciesSP        5.2    9.9   -7.2    5.4   17.9
speciesSS       32.5    8.7   21.4   32.5   43.6
speciesWH       14.0   13.2   -3.2   14.0   31.0
cultivation     -0.1    0.8   -1.1   -0.1    0.9
soil_type10     -8.9   17.3  -31.4   -8.5   13.4
soil_type11    -67.3   16.3  -88.7  -67.0  -46.9
soil_type12    -59.6   16.6  -80.8  -59.8  -38.5
soil_type13    -92.9   24.3 -124.6  -92.7  -61.4
soil_type1a     -1.5   24.3  -32.4   -1.0   28.9
soil_type1g     -9.0   14.6  -27.7   -8.9    9.6
soil_type1x    -18.3   18.0  -41.3  -18.2    4.3
soil_type3       1.0    7.1   -8.1    0.9    9.9
soil_type4       3.1    7.8   -6.7    3.1   13.0
soil_type4b     -2.7   10.0  -15.6   -2.6   10.0
soil_type5b    -21.0   11.5  -35.6  -21.0   -6.6
soil_type6     -10.6    5.9  -18.3  -10.4   -3.1
soil_type7       8.9    6.1    1.0    8.8   16.8
soil_type7b   -241.9   12.9 -258.3 -242.2 -225.6
soil_type8       1.0   24.3  -29.2    0.6   32.5
soil_type9      -4.9    8.4  -15.8   -4.8    5.8
tree_height      5.2    2.0    2.7    5.2    7.9
timber_height   13.3    2.1   10.6   13.4   16.1
sigma           65.6    1.1   64.3   65.6   67.0

Fit Diagnostics:
           mean   sd    10%   50%   90%
mean_PPD 325.4    2.1 322.6 325.3 328.1

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.3  1.0  1694 
basal_area    0.0  1.0  3604 
age           0.0  1.0  2410 
speciesDF     0.3  1.0  2066 
speciesEL     0.4  1.0  2066 
speciesGF     0.3  1.0  1849 
speciesJL     0.3  1.0  2012 
speciesLC     0.4  1.0  4161 
speciesLP     0.3  1.0  1277 
speciesNF     0.3  1.0  3012 
speciesNS     0.3  1.0  1265 
speciesRC     0.4  1.0  3328 
speciesSP     0.3  1.0  1488 
speciesSS     0.3  1.0  1107 
speciesWH     0.3  1.0  1719 
cultivation   0.0  1.0  4310 
soil_type10   0.3  1.0  4329 
soil_type11   0.2  1.0  4397 
soil_type12   0.3  1.0  3695 
soil_type13   0.4  1.0  4339 
soil_type1a   0.3  1.0  5209 
soil_type1g   0.2  1.0  4922 
soil_type1x   0.3  1.0  4635 
soil_type3    0.2  1.0  2191 
soil_type4    0.2  1.0  1676 
soil_type4b   0.2  1.0  2816 
soil_type5b   0.2  1.0  2661 
soil_type6    0.1  1.0  1669 
soil_type7    0.1  1.0  2099 
soil_type7b   0.2  1.0  2921 
soil_type8    0.3  1.0  5038 
soil_type9    0.2  1.0  2375 
tree_height   0.0  1.0  1655 
timber_height 0.1  1.0  1573 
sigma         0.0  1.0  2710 
mean_PPD      0.0  1.0  3473 
log-posterior 0.1  1.0  1484 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Comparison with least-squares fitting

Compare the results with those from a standard least-squares fitting approach. They should give very similar answers in this simple case.

m_lm <- lm(tree_mass ~ basal_area, data = df, na.action = na.exclude)
Code
m_lm <- lm(tree_mass ~ basal_area, data = df, na.action = na.exclude)
summary(m_lm)

Call:
lm(formula = tree_mass ~ basal_area, data = df, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max 
-1029.19   -29.82     3.11    32.81   591.73 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -59.142595   3.953655  -14.96   <2e-16 ***
basal_area    1.053948   0.009336  112.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 87.94 on 1899 degrees of freedom
Multiple R-squared:  0.8703,    Adjusted R-squared:  0.8702 
F-statistic: 1.274e+04 on 1 and 1899 DF,  p-value: < 2.2e-16
hist(resid(m_lm))

predict(m_lm, newdata = data.frame(basal_area = 1000), interval = 'prediction')
       fit      lwr      upr
1 994.8051 821.8989 1167.711

Specifying priors

A Bayesian analysis considers what prior knowledge is available about the model and parameters, and combines this with the available data.

We did this using the default “weak prior”, which specifies very wide (but not uniform) distributions for all parameters. We can examine the default prior distributions for this model with the prior_PD option:

m <- stan_glm(tree_mass ~ basal_area, data = df, prior_PD = TRUE, refresh = 0)
Code
prior_summary(m)
Priors for model 'm' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 325, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 325, scale = 610)

Coefficients
  Specified prior:
    ~ normal(location = 0, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 0, scale = 2.8)

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0041)
------
See help('prior_summary.stanreg') for more details

Could we do better than this?

Further reading

References

Levy, P. E., S. E. Hale, and B. C. Nicoll. 2004. “Biomass Expansion Factors and Root: Shoot Ratios for Coniferous Tree Species in Great Britain.” Forestry 77 (5): 421–30. https://doi.org/10.1093/forestry/77.5.421.
West, Geoffrey B., James H. Brown, and Brian J. Enquist. 1997. “A General Model for the Origin of Allometric Scaling Laws in Biology.” Science 276 (5309): 122–26. https://doi.org/10.1126/science.276.5309.122.

Footnotes

  1. Biological scaling relationships apply to a wide range of morphological traits (e.g. brain size and body size), physiological traits (e.g. metabolic rate and body size in animals) and ecological traits. Some general principles seem to apply, probably because of “how essential materials are transported through space-filling fractal networks of branching tubes(West, Brown, and Enquist 1997).↩︎

  2. “stan” is a computer language, “ARM” is a book: “Applied Regression Modelling”.↩︎

  3. The default alogorithm is Hamiltonian MCMC, but the details are not important just now.↩︎